Importing the Libaries & data set

To create a copy of this subset to avoid errors

1. Exploratory Data Analysis

Details of all numerical columns (continuous variables)

MinTemp

MaxTemp

Rainfall

Evaporation

Sunshine

WindGustSpeed

WindSpeed9am

WindSpeed3pm

Humidity9am

Humidity3pm

Pressure9am

Pressure3pm

Cloud9am

Cloud3pm

Temp9am

Temp3pm

Details of all categorical variables

To show number of missing values

Per the above there are missing values for every column. However some columns are missing substantial amounts of values such as Sunshine with almost half (47.7%) of values missing

To create histogram for each categorical variable

To create histogram for each continuous variable

Scatter plot for all continuous variables

Calculate correlation of all pairs of continuous variables

Chi square for all pairs of categorical variables

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the Location and WindGustDir variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the Location and WindDir9am variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the Location and WindDir3pm variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the WindGustDir and WindDir9am variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the WindGustDir and WindDir3pm variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the WindDir9am and WindDir3pm variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the WindDir9am and WindDir3pm variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the WindDir9am and WindDir3pm variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the WindDir9am and WindDir3pm variables can be rejected ""

"" As the chi-square statistic is more extreme than the critical value, i.e. it lies in the rejection region, the assumption of independence of the WindDir9am and WindDir3pm variables can be rejected ""

Explore our target variable 'RainTomorrow'

Relationship between Location and if it will rain tomorrow

Relationship between WindGustDir and if it will rain tomorrow

Relationship between WindGustDir and if it will rain tomorrow

Relationship between WindGustDir and if it will rain tomorrow

2. Data Preparation

Handling of missing data;

Missing data 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm'

Not missing data 'Location', 'RainToday', 'RainTomorrow', 'Year', 'Month', 'Day'

Dummy-encoding of categorical variables

One-way ANOVA between each categorical variable and one continuous variable

We can now run the one way ANOVA test as the NANs have been taken care of

The below code is to be run twice due to initial error

Standardisation of all continuous variables;

From the above box plots and histograms we can see a quite a number of potential outliers especially for the columns Rainfall and Evaporation. We will confirm the outliers and cap using the interquartile range method

For MinTemp, the minimum and maximum values are -8.2 and 30.3 respectively. So, there are no outliers.

For MaxTemp, the minimum and maximum values are -3.1 and 48.1 respectively. So, there are no outliers.

For Rainfall, the minimum and maximum values are 0 and 371 respectively. So, the outliers are the values greater than 2.4. There is no lower limit.

For Evaporation, the minimum and maximum values are 0 and 86.2 respectively. So, the outliers are the values greater than 9.600000000000001. There is no lower limit.

For WindGustSpeed, the minimum and maximum values are 6 and 126 respectively. So, the outliers are the values greater than 91. There is no lower limit.

For WindSpeed9am, the minimum and maximum values are 0 and 130 respectively. So, the outliers are the values greater than 48.400000000000006. There is no lower limit.

For WindSpeed3pm, the minimum and maximum values are 0 and 78 respectively. So, the outliers are the values greater than 57. There is no lower limit.

For Humidity9am, the minimum and maximum values are 3 and 100 respectively. So, there are no outliers

For Pressure9am, the minimum and maximum values are 982 and 1040.4 respectively. So, the outliers are the values less than 984.4. There is no upper limit.

For Pressure3pm, the minimum and maximum values are 977.1 and 1038.9 respectively. So, the outliers are the values less than 981.5999999999999. There is no upper limit.

For Temp9am, the minimum and maximum values are -7 and 39 respectively. So, the outliers are values greater than 35 and less than -1.1999999999999993.

For Temp3pm, the minimum and maximum values are -4.2 and 46.2 respectively. So, in this case there are no outliers.

Splitting of data into train and test partitions.

Please review the Code part 2 for ridge, lasso, PCA and Logistic Regression